WIP: DecayRange by mjp41 · Pull Request #486 · microsoft/snmalloc

mjp41 · 2022-03-21T16:57:31Z

Implemenation of a range that gradually releases memory back to
the OS. It quickly pulls memory, but the dealloc_range locally caches
the memory and uses Pal timers to release it back to the next level
range when sufficient time has passed.

codify that parent range needs to be concurrency safe.
Remove unused code

nwf · 2022-03-21T21:20:11Z

+
+namespace snmalloc
+{
+  template<typename Rep>


Given the reuse of the large buddy range rep here, at least a comment (or a concept) might be in order.

nwf · 2022-03-21T21:45:33Z

+        }
+
+        // We have run out of memory.
+        handle_decay_tick(); // Try to free some memory.


Does this need to be interlocked against the timer firing? I suppose not due to the prepend-only nature of all_local, the read-only nature of the spine traversal, and the use of pop_all for each found sizeclass... assuming that the parent range doesn't need interlocking, which, by default anyway, it doesn't (specifically, the parent will be a CommitRange whose parent is a GlobalRange by default, and CommitRange doesn't actually have state and GlobalRange is an interlock).

The presumption of concurrency-safeness of parent might merit being written down somewhere?

Yeah, I plan to add a static constexpr to all the types that is the concurrency safe property like currently happens with Align. I just hadn't threaded it through yet. So GlobalRange would be true, CommitRange would be whatever the parent says, and the buddy would be false.

mjp41 · 2022-03-23T16:22:37Z

So the perf of this is okay, but it increases memory footprint for some examples too much. I have factored out the primary changes to enable this into #491, so that can be landed, and the perf of this can be fixed and landed at a later point.

mjp41 · 2022-05-13T13:46:29Z

This paper has a really interesting approach to work stealing of chunks between threads:
https://dl.acm.org/doi/10.1145/3533724

I think we could use some of the ideas in this paper, to make the decay range perform better.

SchrodingerZhu · 2026-02-23T17:31:01Z

BTW, I have recently worked on a Weak AVL Tree:

llvm/llvm-project#172411

which behaves in between of an AVL and a red black tree, adaptively based on insertion/deletion rate. If data structure performance is a concern, weak AVL may be worth a try.

Do we require the pointer stability of the node? If not, btree is almost always faster.

mjp41 · 2026-02-24T09:10:01Z

BTW, I have recently worked on a Weak AVL Tree:

llvm/llvm-project#172411

which behaves in between of an AVL and a red black tree, adaptively based on insertion/deletion rate. If data structure performance is a concern, weak AVL may be worth a try.

Do we require the pointer stability of the node? If not, btree is almost always faster.

Oh, that is really interesting. We have a lot of constraints on the red-black tree code as it uses the pagemap as the storage for the nodes. This means it can only use 16bytes for a node, and about four bits of that are already reserved.

SchrodingerZhu · 2026-02-24T13:11:32Z

Then WAVL should be a drop-in solution. There are two variants, one uses one-bit to store parity and another one uses two-bit to store rank-diff-flags. The second is a little bit faster, perhaps because it does not need to access the bit in the children to recover the 'two-bit' information.

mjp41 · 2026-02-24T15:09:52Z

Then WAVL should be a drop-in solution. There are two variants, one uses one-bit to store parity and another one uses two-bit to store rank-diff-flags. The second is a little bit faster, perhaps because it does not need to access the bit in the children to recover the 'two-bit' information.

This sounds really interesting. Do you have time to experiment with this for snmalloc? If not, would you be happy to submit an issue, so we don't lose the idea.

SchrodingerZhu · 2026-02-25T03:33:01Z

This sounds really interesting. Do you have time to experiment with this for snmalloc? If not, would you be happy to submit an issue, so we don't lose the idea.

I can have a try, at least we can let some AI agent port it to bench first. Do you have specific instructions to replay the workload where rbtree consolidating is considered important.

mjp41 · 2026-02-25T08:14:12Z

This sounds really interesting. Do you have time to experiment with this for snmalloc? If not, would you be happy to submit an issue, so we don't lose the idea.

I can have a try, at least we can let some AI agent port it to bench first. Do you have specific instructions to replay the workload where rbtree consolidating is considered important.

This commit exercises the rbtree pretty heavily:

bf7a152

SchrodingerZhu · 2026-02-25T15:34:18Z

According to some primitive test (with x4 iteration compared to original test file):

RBTree Replacement Benchmark Report (2026-02-25)

Scope

Tree backend variants:
- 0: Red-Black tree (baseline)
- 1: WAVL 2-bit diff
- 2: WAVL 1-bit parity
Large alloc benchmark workload increased from 100000 to 400000 iterations (x4).
Repetitions increased to 10 runs per variant for statistics (mean/stddev/min/max).
Hyperfine also run with 10 repetitions.

Package + Remote Run

Archive: /tmp/snmalloc-rbtree-variants-20260225.zip
Uploaded to: spark:/tmp/snmalloc-rbtree-variants-20260225.zip
Remote workdir: /tmp/snmalloc-rbtree-variants-20260225

Environments

Local: Linux mtheory 6.19.3-2-cachyos Set up CI with Azure Pipelines #1 SMP PREEMPT_DYNAMIC Thu, 19 Feb 2026 21:03:04 +0000 x86_64 GNU/Linux
Spark: Linux promaxgb10-edf0 6.14.0-1015-nvidia Disable rtti / exceptions on the non-Windows builds. #15-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 25 18:02:16 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

Local 10-Run Metric Stats (ns)

variant	metric	n	mean_ns	stddev_ns	min_ns	max_ns	delta_vs_rb_ns	delta_vs_rb_pct
rb	alloc_dealloc	10	34553400.70	1310666.08	32439904	37218008	0.00	+0.00%
w2	alloc_dealloc	10	32305511.90	1422450.42	30445550	34602621	-2247888.80	-6.51%
w1	alloc_dealloc	10	30988844.30	1278594.28	28944532	33868143	-3564556.40	-10.32%
rb	batch_alloc_dealloc	10	88992666.30	3468105.48	84547398	95170524	0.00	+0.00%
w2	batch_alloc_dealloc	10	68550446.70	2775490.79	64034170	71758820	-20442219.60	-22.97%
w1	batch_alloc_dealloc	10	69224794.00	1657977.26	66713419	71818604	-19767872.30	-22.21%
rb	alloc_touch_dealloc	10	37034922.30	1461485.12	35165642	39571968	0.00	+0.00%
w2	alloc_touch_dealloc	10	33708889.80	1158505.79	31419122	35025455	-3326032.50	-8.98%
w1	alloc_touch_dealloc	10	32279160.80	1207555.98	30970969	34691720	-4755761.50	-12.84%

Local Hyperfine (10 runs)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./build-rb/perf-large_alloc-fast`	162.5 ± 4.8	157.3	169.7	1.03 ± 0.08
`./build-w2/perf-large_alloc-fast`	157.2 ± 10.9	148.8	182.5	1.00
`./build-w1/perf-large_alloc-fast`	176.3 ± 12.7	157.7	191.0	1.12 ± 0.11

Spark (aarch64) 10-Run Metric Stats (ns)

variant	metric	n	mean_ns	stddev_ns	min_ns	max_ns	delta_vs_rb_ns	delta_vs_rb_pct
rb	alloc_dealloc	10	36912757.90	8786222.89	32479343	54836176	0.00	+0.00%
w2	alloc_dealloc	10	46133636.00	9692479.56	32085710	57150088	9220878.10	+24.98%
w1	alloc_dealloc	10	39381360.00	10398906.56	30973994	55649475	2468602.10	+6.69%
rb	batch_alloc_dealloc	10	120928117.90	4419886.10	115008482	125408191	0.00	+0.00%
w2	batch_alloc_dealloc	10	88191151.10	204937.33	87790801	88468595	-32736966.80	-27.07%
w1	batch_alloc_dealloc	10	87219709.70	315244.24	86645261	87587249	-33708408.20	-27.87%
rb	alloc_touch_dealloc	10	32501404.10	100263.29	32212927	32582160	0.00	+0.00%
w2	alloc_touch_dealloc	10	32077673.10	152537.46	31795581	32269678	-423731.00	-1.30%
w1	alloc_touch_dealloc	10	31466960.90	102171.68	31324316	31684637	-1034443.20	-3.18%

Spark Hyperfine (10 runs)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./build-rb/perf-large_alloc-fast`	192.3 ± 9.3	180.9	206.2	1.24 ± 0.09
`./build-w2/perf-large_alloc-fast`	156.5 ± 6.3	152.3	168.6	1.01 ± 0.07
`./build-w1/perf-large_alloc-fast`	155.2 ± 8.0	149.6	169.4	1.00

Notes

In-program metric timers (ns lines from perf-large_alloc-fast) and hyperfine wall-time can rank variants differently.
This report is post-fix and supersedes earlier numbers from intermediate iterations.

SchrodingerZhu · 2026-02-25T15:48:02Z

While the data structure should be correctly implemented, the codex's code appears very adhoc so I may need to craft this change by hands. Given that the 1-bit rank parity approach seems to be the most promising solution, I will just retain that single implementation.

mjp41 force-pushed the decayrange branch 2 times, most recently from 5ad1005 to f7e897a Compare March 21, 2022 21:31

nwf reviewed Mar 21, 2022

View reviewed changes

Comment thread src/backend/decayrange.h Outdated

mjp41 force-pushed the decayrange branch 6 times, most recently from f6254d6 to 93d6e3f Compare March 23, 2022 16:18

mjp41 force-pushed the decayrange branch 4 times, most recently from 37a1ce8 to 9694c96 Compare March 23, 2022 20:57

Add Decay Range.

1fc42a1

mjp41 force-pushed the decayrange branch from 9694c96 to 1fc42a1 Compare March 24, 2022 08:02

Enable tracing to see why CI is crashing.

80bc701

mjp41 force-pushed the decayrange branch from 048d344 to 80bc701 Compare March 24, 2022 09:35

mjp41 mentioned this pull request Feb 22, 2026

Transient workloads with large object alloc & dealloc pairs thrash slabs back and forth from the kernel #811

Open

Uh oh!

Conversation

mjp41 commented Mar 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nwf Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nwf Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

mjp41 Mar 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mjp41 commented Mar 23, 2022

Uh oh!

mjp41 commented May 13, 2022

Uh oh!

SchrodingerZhu commented Feb 23, 2026

Uh oh!

mjp41 commented Feb 24, 2026

Uh oh!

SchrodingerZhu commented Feb 24, 2026

Uh oh!

mjp41 commented Feb 24, 2026

Uh oh!

SchrodingerZhu commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjp41 commented Feb 25, 2026

Uh oh!

SchrodingerZhu commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

RBTree Replacement Benchmark Report (2026-02-25)

Scope

Package + Remote Run

Environments

Local 10-Run Metric Stats (ns)

Local Hyperfine (10 runs)

Spark (aarch64) 10-Run Metric Stats (ns)

Spark Hyperfine (10 runs)

Notes

Uh oh!

SchrodingerZhu commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mjp41 commented Mar 21, 2022 •

edited

Loading

mjp41 Mar 22, 2022 •

edited

Loading

SchrodingerZhu commented Feb 25, 2026 •

edited

Loading

SchrodingerZhu commented Feb 25, 2026 •

edited

Loading